2025-03-24
Goal:
Semi-ineractive Walk-through of the process for preprocessing, collecting, and analyzing donated WhatsApp Chat Log data.
| Time | Block |
|---|---|
| 09:00 - 09:15 | Presentation: Overview of WhatsApp Chat Log Data |
| 09:15 - 09:45 | Code along: Exporting & Parsing WhatsApp Chat Log Data |
| 09:45 - 09:55 | Presentation: Anonymization & Consent Checking |
| 09:55 - 10:10 | Code along: Anonymization & Consent Checking |
| 10:10 - 10:20 | Presentation: ChatDashboard for Data Donation Studies |
| 10:20 - 10:45 | Code along: Installing and adapting ChatDashboard |
| 10:45 - 11:00 | Discussion, Q&A |
What you need to code-along:
devtools::install_github("gesiscss/WhatsR")
Chat log data offer extraordinarily rich, high quality data about everyday interpersonal interactions.
Linguistics
(Ueberwasser & Stark, 2017; Verheijen & Stoop, 2016)
Alcohol Consumption
(Jensen & Hussong, 2021)
Profane Language & Sexual Topics
(Underwood et al., 2012)
Relationship Formation & Maintenance
(Brinberg et al., 2021; Brinberg & Ram, 2021)
Network Traffic
(A. Seufert et al., 2023; M. Seufert et al., 2016)
Communication during COVID-19
(A. Seufert et al., 2022)
Political Discussions
(Caetano et al., 2018)
Misinformation & Fake News
(Freitas Melo et al., 2020; Garimella & Eckles, 2020)
most popular MIM in the world
(Kemp, 2025)
2 Billion monthly active users
(Kemp, 2025; Montag et al., 2015)
available for Android and iOS
Unobtrusively logs interactions
Option for chat-log exports
Retrospective, highly granular communication data
WhatsApp Chat log data can be obtained in at least 2 different ways:
1) Joining the conversation
researchers identify target conversations or groups
researchers join the conversation
researchers export the chat log data from the group using the WhatsApp export function
Advantanges:
Disadvantages:
Data collection is not retrospective
More effort for researchers
Participants are either aware of being studied or not asked for consent
Method was first described in 2018
(Garimella & Tyson, 2018)
Approach was mainly used for investigating “public” WhatsApp Groups
South-east Asia
(Narayanan et al., 2019)
South America
(Machado et al., 2019; Melo et al., 2019; Resende et al., 2019)
Can be semi-automated by scraping public invite links to further groups
(Bursztyn & Birnbaum, 2019)
Other approaches are to:
ask participants for consent to join private conversations
(García-Gómez, 2018)
Create a new group and ask participants to join
(Sprugnoli et al., 2018)
WhatsApp Chat log data can be obtained in at least 2 different ways:
1) Data Donations
researchers identify target conversations or groups
researchers ask participants to export chat logs
participants donate the exported chat logs to the researchers
Advantanges:
Retrospective data collection
full transparency for participants
Active, opt-in consent
Disadvantages:
Language use in multilingual conversations
(Ueberwasser & Stark, 2017)
Teenage slang
(Verheijen & Stoop, 2016)
Communication during COVID-19
(A. Seufert et al., 2022)
Interpersonal Relationship Research
(Kohne & Montag, under review)
Social Network Analyses
(Corten et al., n.d.)
Methodological Research
(Hase & Haim, 2024)
Network Traffic
(A. Seufert et al., 2023; M. Seufert et al., 2016)
Users in WhatsApp can export:
An individual chat with a person a group
exported directly to the persons phone or send with service of choice
unencrypted .txt file or zip file
A complete backup file of all their conversations, including media files
Google Account or iCloud necessary
saved as a backup file, can not easily be interacted with manually
Designed for data recovery, not data sharing
AFAIK, no tools leverage this as a source of data
WhatsApp chat logs contain a lot of personal identifiable information (PII).
There are multiple good reason to remove these:
Parsimony: Researchers should only work with data that they absolutely need
Ethics Boards: Getting approval from an ethics board is easier for anonymous data
Participation Willingness: People are more likely to share their data if it’s anonymized
Consent: Consent might only be necessary from the data donor, not from all participants
FAIR data: Data can be shared much more easily when it’s anonymized
However: Depending on the research question at hand, raw data might be necessary.
Essentially, there are two ways to anonymize data:
Delete the parts of the data that contain PII
Alter the parts of the data that contain PII
Aggregate
Pseudonymize
Reduce
If possible, researchers like to go with option 2 whenever feasible to retain as much information as possible.
WhatsR| Column Name | Description | PII | Anonymization |
|---|---|---|---|
| DateTime | Timestamp (yyyy-mm-dd hh:mm:ss) | no | none |
| Sender | Sender name (incl. system msgs) | yes | placeholder |
| Message | User message text | yes | deleted |
| Flat | Simplified message | yes | deleted |
| TokVec | Tokenized message (list of words) | yes | deleted |
| URL | URLs/domains | yes | domains |
| Media | Media filenames | yes | file ext |
| Location | Location URLs/indicators | yes | indicator |
| Emoji | Emoji glyphs | no | none |
| EmoDesc | Emoji text | no | none |
| Smilies | Smileys | no | none |
| SysMsg | System messages | yes | deleted |
| TokCount | Token count | no | none |
| TimeOrd | Timestamp order | no | none |
| DispOrd | Chat display order | no | none |
Several column are completely unproblematic as they do not contain any PII
Some columns are problematic but can be anonymized
The columns containing the sent messages are highly problematic because the can contain any form of PII in any format.
While sophisticated anonymization software exists to potentially anonymize this. WhatsR deletes these columns for anonymization.
In data donation studies, we automatically get the consent of the data donor.
However, for WhatsApp Data, the data donation also contains information about other people
For anonymous data, it might be enough to only get the consent of the chat donor
For raw data, researchers should get the consent of all chat participants
However: Doing this can be very tricky because it’s effortfull and time consuming
WhatsR has a built-in option to make this easier.
Researchers pick a predefined consent message string
Researchers instruct donors to post the consent string into the chat
Data donors ask all participants to repost the message if they consent
Data donors donate the data
All content from people who did not repost the message is deleted during parsing
Active opt-in consent from all chat participants
06.04.19, 09:14 - Frank: Hallo alle, ich nehme an einer wiss. Studie Teil und spende diesen Chatverlauf -
Bitte repostet den folgenden Text wenn ihr zustimmt.
Wenn nicht werden eure Daten automatisch gelöscht. 06.04.19, 09:38 - Frank: Ich stimme zu meine Chatdaten der Studie XYZ der Uni ABC zu
spenden Alle Studieninformationen habe ich hier eingesehen: www.abc.de
und bin damit einverstanden. 06.04.19, 11:18 - Bob: Ich stimme zu meine Chatdaten der Studie XYZ der Uni ABC zu
spenden Alle Studieninformationen habe ich hier eingesehen: www.abc.de und bin
damit einverstanden. 06.04.19, 11:48 - Elli: Ich stimme zu meine Chatdaten der Studie XYZ der Uni ABC zu
spenden Alle Studieninformationen habe ich hier eingesehen: www.abc.de und bin
damit einverstanden. 06.04.19, 11:49 - Max: Ich bin nicht einverstanden! Meine Daten bitte alle löschen! 06.04.19, 10:49 - Frank:Alles klar, danke euch!
06.04.19, 09:14 - Frank: Hallo alle, ich nehme an einer wiss. Studie Teil und spende diesen Chatverlauf -
Bitte repostet den folgenden Text wenn ihr zustimmt.
Wenn nicht werden eure Daten automatisch gelöscht. 06.04.19, 09:38 - Frank: Ich stimme zu meine Chatdaten der Studie XYZ der Uni ABC zu
spenden Alle Studieninformationen habe ich hier eingesehen: www.abc.de
und bin damit einverstanden. 06.04.19, 11:18 - Bob: Ich stimme zu meine Chatdaten der Studie XYZ der Uni ABC zu
spenden Alle Studieninformationen habe ich hier eingesehen: www.abc.de und bin
damit einverstanden. 06.04.19, 11:48 - Elli: Ich stimme zu meine Chatdaten der Studie XYZ der Uni ABC zu
spenden Alle Studieninformationen habe ich hier eingesehen: www.abc.de und bin
damit einverstanden.06.04.19, 11:49 - Max: Ich bin nicht einverstanden! Meine Daten bitte alle löschen!06.04.19, 10:49 - Frank: Alles klar, danke euch!
WhatsR can be used to parse, anonymize and check consent for donated WhatsApp chat logsSteps to set up the ChatDashboard for a data donation study:
Congrats, you can now use ChatDashboard for data donations!
ChatDashboard is an R shiny web app. It can be modified and customized without knowing web development frameworks like React, Angular or Vue.js
The processing of data is handled by WhatsR and can be “easily” modified or extended with your own code
ChatDashboard uses SSL encryption for data in transit and RSA encryption for data at rest
Researchers generate their own RSA keys, so they are the only ones able to access the data after it’s encrypted
Researchers can host the ChatDashboard on their own server, enabling them to add additional layers of encryption if desired
ChatDashboard has an interactive interface that allows participants to manually remove data before donation
Participants can remove whole columns or individual messages
Consent checking and anonymization are done after manual removal and not affected by participants choices
Researchers can quantify later how much data was removed by participants
A participant ID can be forwarded to ChatDashboard via an URL-parameter
The participant ID then becomes a valid username for logging in to ChatDashboard
The participant ID is attached to the data donation in the file name and can be used to link the chat to the corresponding survey data
© 2025 GESIS - Leibniz Institute for the Social Sciences